FEOpti-ACVP: identification of novel anti-coronavirus peptide sequences based on feature engineering and optimization

Abstract Anti-coronavirus peptides (ACVPs) represent a relatively novel approach of inhibiting the adsorption and fusion of the virus with human cells. Several peptide-based inhibitors showed promise as potential therapeutic drug candidates. However, identifying such peptides in laboratory experiments is both costly and time consuming. Therefore, there is growing interest in using computational methods to predict ACVPs. Here, we describe a model for the prediction of ACVPs that is based on the combination of feature engineering (FE) optimization and deep representation learning. FEOpti-ACVP was pre-trained using two feature extraction frameworks. At the next step, several machine learning approaches were tested in to construct the final algorithm. The final version of FEOpti-ACVP outperformed existing methods used for ACVPs prediction and it has the potential to become a valuable tool in ACVP drug design. A user-friendly webserver of FEOpti-ACVP can be accessed at http://servers.aibiochem.net/soft/FEOpti-ACVP/.


INTRODUCTION
Coronaviruses contain a positive-strand RNA genome that, ranging from 27 to 32 kb, is the largest of all known RNA viruses [1].Once the virus enters host cell, a series of non-structural proteins form a large supramolecular protein structure referred to as the replication and transcription complex (RTC).Containing polymerase, primase, helicase, methyltransferase and nuclease enzymes, as well as various cofactor proteins, are essential for the core process of viral transcription and replication.Not surprisingly, it contains several potential targets for antiviral drug design [2].Severe acute respiratory syndrome-associated coronavirus, SARS-CoV-2, was discovered in late 2019 [3].It belongs to the beta coronavirus genus within the Coronaviridae family.The pandemic caused by this virus killed millions of people globally and led to lasting health problems in a proportion of patients who recovered.Beta coronaviruses were also responsible for the severe acute respiratory syndrome (SARS) epidemic in 2003 and the outbreak of middle east respiratory syndrome (MERS) in 2012.Given this history, it is essential to develop novel strategies to combat infections cause by this class of viruses.
The difficulty of developing small molecule drugs, combined with the growing maturity of genetic engineering and solid-phase peptide synthesis technology, resulted in the emergence of therapeutic peptide drugs [4][5][6][7][8][9].This interest is also fueled by the wide adaptability, high safety and remarkable efficacy of this approach.In particular, peptide-based pan-coronavirus fusion inhibitors can effectively inhibit infections by SARS-CoV-2 and its variants, as well as other human coronaviruses [10][11][12][13][14][15][16].Although laboratory experiments can identify novel anti-coronavirus peptides (ACVPs), this process is expensive and time consuming.Therefore, there is an increasing interest in the use of computational approaches to predict peptides with potential ACVPs activity.
Several methods have been developed to predict the sequence of potential ACVPs.In 2021, Pang et al. [17] described a twostage classification PreAntiCoV approach, taking into account the amino acid composition (AAC), dipeptide composition (DiC), the composition of k-spaced amino acid group pairs (CKSAAGP), pseudo amino acid composition (PAAC) and physicochemical features (PHYC) to form peptide descriptors using a balanced random forest (RF) dataset.In the same year, Timmons et al. [18] proposed ENNAVIA to predict antiviral and ACVPs.This approach used amino acid and other physicochemical descriptors for feature extraction and combined these with machine learning (ML) utilizing deep neural networks, which achieved better predictive functionality than PreAntiCov [19].In 2022, Kurata et al. [20] introduced iACVP, an approach that used a word embedding model, word2vec (W2V), to extract peptide sequence features.This was combined with several ML methods, including transformer, convolutional neural network (CNN), bidirectional long short-term memory (BiLSTM), RF and support vector machine (SVM) to develop classifiers [21][22][23][24].In 2023, Chen et al. [25] proposed PACVP, a peptide prediction strategy that employed nine peptide sequence characteristics to extract relevant features identifying potential ACVPs.These included the previously tried AAC, CKSAAGP, PAAC and DiC.In addition, they also incorporated adaptive skip dipeptide composition (ASDC), the composition transition and distribution (CTD) tool, quasi-sequence order descriptor (QSO), amino acid entropy (AAE), Boruta, multi-resolution multidirectional (MRMD) analysis and a tree-based algorithm (SFM-XGBoost).This group also tested five ML methods and superposition learning during the development of their peptide predictor.This expansion of the analyzed characteristics brought some benefits in improving the accuracy in ACVPs prediction.While all of the above approaches achieved reasonable performance in tests, they still failed to meet the needs of drug development.The fact that only a relatively limited number of experimentally validated ACVP sequence data was available for model training and development led to an imbalanced benchmark dataset.To obtain more information about relevant sequence features, the use of additional computational frameworks, including deep learning and natural language processing, was suggested [26][27][28][29].
Deep learning learns the underlying patterns in large datasets by understanding the laws and layers of sample data to replicate human analytical and learning abilities through artificial intelligence.A range of deep learning-based computational approaches has been used widely in bioinformatics, allowing machines to identify features of protein or peptide sequences and transform these to represent relevant information about the object of interest [30][31][32][33][34][35][36][37].Effective sequence representation approaches include soft-alignment mechanism (SSA) [38], unified representation (UniRep) [38][39][40], BiLSTM [41] and bidirectional encoder representation from transformers (BERT) [42].
In this study, we describe the development of a novel approach to predict ACVPs, FEOpti-ACVP, that is based on feature engineering (FE) optimization.Since the existing benchmark dataset of identified ACVPs is imbalanced, the synthetic minority over-sampling technique (SMOTE) was utilized to improve peptide representation.The resulting feature vectors and their fused representations, optimized by the embedding feature selection method, were entered into seven ML models to construct the new ACVPs predictor.In addition, we took advantage of the unified manifold approximation and projection (UMAP) algorithm to visualize optimized feature vectors and to compare model performance.The results showed that FEOpti-ACVP demonstrated superior performance for the prediction of ACVPs in both 5-fold cross-validation (ACC = 0.992, MCC = 0.984, Sn = 0.998, Sp = 0.986, auROC = 1.000, auPRC = 1.000 and F1 = 0.992) and in independent tests (ACC = 0.967, MCC = 0.742, Sn = 0.656, Sp = 0.992, auROC = 0.927, auPRC = 0.729 and F1 = 0.789).In comparison with existing prediction models, FEOpti-ACVP showed better overall performance and has the potential to be applied widely in the discovery of novel ACVP drugs.

Benchmark dataset
To facilitate further comparisons, the model in this paper was developed using the latest dataset from PACVP [25,43,44].This ACVP-2141 benchmark dataset contains 157 ACVP and 1984 peptides with no demonstrable anti-coronaviral activity (named non-ACVP).The dataset was 'de-homologized' by removing sequences with >90% identity.For the purposes of training and testing our model, this dataset was arbitrarily partitioned at a 4: 1 ratio to construct a training dataset, ACVP-TR, and the test dataset, ACVP-IND.

Overview of FEOpti-ACVP
FEOpti-ACVP's development strategy is in Figure 1.The process involved the construction of the benchmark dataset, feature extraction, synthesis of features to balance the dataset, feature selection, ML model training, 5-fold cross-validation, independent testing, and finally, constructing the FEOpti-ACVP webserver.

Synthetic minority over-sampling technique
As mentioned in Section 2.1, the negative dataset is ∼13 times larger than the positive dataset.One approach to address class imbalance in datasets is to use the SMOTE [45].Here's how it works: for each sample x in the minority class, calculate its Euclidean distance with all other samples in the class.Then, select the value of k (k is a parameter) to determine the number of nearest neighbors of the sample.From these neighbors, choose one (for example, y) randomly to generate a new sample z.The formula for generating z is Here, λ is a random number between 0 and 1.This process is repeated until the desired balance between the minority and majority classes is achieved.

Feature extraction
In the pre-training stage of model development, feature extraction is an effective way of strengthening data representation, since the obtained new features map onto the original features.We used both the UniRep and BERT for the purposes of feature extraction to pre-train our model on the ACVP-2141 dataset.The obtained feature vectors, processed by the two distinct algorithms, were then used as the input for ML model training and comparison.

The UniRep feature extractor
Alley et al. [46] developed a deep learning-based sequence representation method, UniRep, a one-way mLSTM with 1900 hidden features.The authors used a recurrent neural network (RNN) to compile the statistical representation of proteins, resulting in a unified representation that integrates arbitrary protein sequences as fixed-length vectors that approximate underlying protein features.Although the fixed hidden state of RNNs can result in a bottleneck, the UniRep approach demonstrated that RNNs could adequately use raw sequence data for the task of training an algorithm to predict the next amino acid in a sequence.Thus, using UniRep could alleviate data scarcity that previously constrained protein informatics.

The BERT feature extractor
Devlin et al. [47,48] introduced a pre-trained language representation model, BERT.It uses a masked language model (MLM) to pretrain bidirectional transformers, where each token attends to 'all tokens'.After stacking multiple layers of transformer structures, the main framework of the BERT is created by stacking multiple layers of transformer structures.The model ultimately provides a deep bidirectional linguistic representation capable of incorporating contextual information from any direction.
Firstly, the ACVP sequence is converted into a token representation of k-mer, and a token representation position is added to obtain the input token.Then, the context semantics are captured by the multi-head self-attention model, and a specific adjustment is obtained through linear transformation, thus ending the forward propagation of one layer.Through multiple such layers in the model (as shown in Figure 1), the final results are used for the BERT pre-training task.

Feature selection
Feature selection, an indispensable step in building predictive models, is the process where significant features are identified and retained, while redundant/uninformative features are discarded.This process alleviates the issue of excessive dimensionality, and reduces the risk of overfitting, thus improving model performance.Filter, wrapper and embedded are three methods used in feature selection.During the creation of our model, we employed embedded to obtain feature importance and perform feature selection while the model was being trained.We introduced a light gradient boosting machine (LGBM) step to construct an iterative boosting tree model [49], covering the original features, creating weight coefficients for each feature.Features were selected according to this weight coefficient, ranking them from largest to smallest.

Machine learning
Supervised learning is a training technique in ML where the algorithm uses a dataset of known categories to adjust classifier parameters to build a learning model that achieves the target performance.The categorization of new instances is predicted based on this trained model.Regardless of learning style or function, all ML algorithms involve the following elements: representation, evaluation and optimization.In this paper, seven supervised learning algorithms were tested, comparing their performance, to select the best option to build the final model.The tested algorithms included LGBM [41], RF [33], SVM [29], k-nearest neighbors (KNN) [42], linear regression (LR), latent dirichlet allocation (LDA) [38] and naive bayes (NB) [50], which have been proven effective in other peptide/protein predictions.

Performance measures
During the development of model, we relied on seven metrics commonly used to evaluate performance: accuracy (ACC), Matthews correlation coefficient (MCC), sensitivity (Sn), specificity (Sp), F1-score (F1), area under the receiver operating characteristic curve (auROC) and area under the precisionrecall curve (auPRC) [51,52].F1 is a reconciled average of precision and recall.In general, auPRC and F1 are also considered objective measures of the performance of models developed using imbalanced datasets [53].The formulas used to derive the above metrics are as follows: where TP (true-positive) is the number of correctly identify positive ACVPs, TN (true-negative) is the number of correctly identified non-ACVPs, FP (false-positive) is the number of ineffective peptides incorrectly classified as ACVPs and FN (false-negative) is the number of ACVPs incorrectly identified as non-ACVPs.We employed 5-fold cross-validation and independent tests were to evaluate these models.For the purposes of testing, the ACVP-TR dataset was subdivided into five subsets, four of these were used for training, while the last subset is used in the final testing of the model.The average of the five tests is used as the final result of the 5-fold cross-validation [54,55].Independent testing used the ACVP-IND data subset that did not overlap with the training dataset.Thus, samples used for independent testing were new to the model.

SMOTE expanded dataset promotes better model performance
Since the number of peptides with confirmed ACVP activity is currently limited, any training dataset is skewed by the overrepresentation of peptides with no activity, causing an imbalance (Figure 4A-C).To improve the effectiveness of the pre-trained model, we introduced SMOTE to improve the balance of the dataset (Figure 4D-F).The results of the 5-fold cross-validation and independent testing achieved using the SMOTE generated dataset are shown in Supplementary Table 1, while the visualization of feature representations by UMAP is shown in Figure 4.
As shown in Figure 2, we ranked the average of the 5-fold crossvalidation and independent test results according to different metrics, including ACC, MCC, Sn and auPRC.Sn (recall) and auPRC are recognized as efficiency metrics in evaluating imbalanced data.In the top 14 rankings of ACC, MCC, Sn and auPRC, the proportion of SMOTE models accounted for 78.57, 92.86, 85.71 and 85.71%, respectively.Therefore, introducing SMOTE as a balancing strategy was a powerful way to improve model performance.The following sections discuss data based on this balancing strategy.

The effect of feature fusion on prediction performance
In an attempt to obtain more representative feature information, we fused the 1900D UniRep feature vectors with the 768D BERT feature vectors, obtaining a 2668D combined UniRep+BERT feature vector set and tested the predictive performance of all three vector sets.The results of the 5-fold cross-validation and independent testing are shown in Table 1.
In short, although there was a small improvement in the test result in some of the metrics when the fused vector set was used as input, the overall effect was not significant.For example, the fused feature model analyzed by LR outperformed the other vector sets on all 5-fold cross-validation and some of the independent test metrics (Sn: 4.55% -15.01%, auROC: 0.20% -6.36%, auPRC: 0.73% -36.72%,F1: 2.38% -10.49%).However, as apparent from Table 1, the high-dimensional fused features are likely to represent redundant information and further processing with most ML algorithms did not improve prediction performance but rather degraded it.

Feature selection can improve predictive performance
Next, we sorted the features according to their importance using the LGBM algorithm.This step eliminated redundant and irrelevant features, while retaining significant ones.The results of 5-fold cross-validation and independent testing of models using the selected features are shown in Table 2.We also compared the developed models as shown in Figure 3, where the metrics shown represent the average of independent tests and the 5-fold crossvalidation results.
As these results indicate, irrespective of the ML method used, model performance improved with the use of selected features.As an example, when using LGBM (Figure 3A), the UniRep-128D model outperformed the UniRep-1900D model in 85.7% of the metrics (ACC: 0.14%, MCC: 2.35%, Sn: 3.88%, auROC: 0.27%, auPRC: 0.75%, F1: 2.61%).Thus, feature selection was an effective way of identifying important features and enhancing the performance of the ACVP prediction model.

Visualization of feature representations
UMAP models the manifold with a fuzzy topology and finds embeddings by searching for low-dimensional projections of the data with the closest equivalent topology.The global structure of the data is maintained as much as possible.The visualization of the main feature models developed in this paper is shown in Figure 4.
Since, as mentioned before, our dataset was imbalanced (Figure 4B, C), such feature models tend to be misleading.Therefore, we performed a SMOTE-based data augmentation on the feature models (Figure 4D-F) so that non-ACVPs and ACVPs were not entirely isolated during visualization.As can be seen in Figure 4H and I, using selected features could separate non-ACVP and ACVP sequences better than the high-dimensional features without applied selection (Figure 4D-F).

Comparing FEOpti-ACVP with previously reported selection strategies
In its final implementation, based on accuracy in independent test results, FEOpti-ACVP uses the top 128-dimensional features of UniRep data and combines this with ML analysis using the LGBM algorithm.The average results of 5-fold cross-validation and independent test for all SMOTE-balanced models are shown in Supplementary Table 2.As a final test, we compared the performance of this optimized version of FEOpti-ACVP with existing ACVP predictors, including PACVP [25], iACVP [20] and ENNAVIA-D [18].The results of these comparisons using the dedicated test peptide dataset are shown in Table 3, while data on the 5-fold cross-validation are illustrated in Supplementary Table 3.

CONCLUSION
This report describes the development of FEOpti-ACVP, a prediction model capable of identifying potential novel ACVPs.The results showed that feature extraction based on deep characterization learning was an effective strategy.The use of    tools, FEOpti-ACVP had a superior ability to predict and identify effective ACVPs, achieving an ACC of 0.967 and a Sp of 0.992 on the independent test dataset.To make this tool available to the scientific community, we constructed a web server for FEOpti-ACVP ( http://servers.aibiochem.net/soft/FEOpti-ACVP/).
Here, the user only needs to enter the peptide sequence and click the run button to predict the probability of a peptide being a potential ACVP drug candidate.In addition, we hope that this prediction framework will have broader applications in bioinformatics.

Figure 1 .
Figure 1.Schematic diagram of model development.(A) Construction and partitioning of benchmark dataset.(B) Feature vector extraction using the pre-trained UniRep and BERT sequence embedding models.This extraction was conducted twice, creating either the 1900-dimensional(1900D) UniRep feature vectors or the 768-dimensional BERT feature vectors.These were also fused to obtain the 2668-dimensional UniRep+BERT feature vectors.(C) The SMOTE model was used to balance the heavily skewed representation of ACVP and non-ACVP sequences in the original dataset.(D) Using the LGBM algorithm, the extracted feature vectors were ranked according to their importance, to eliminate redundant features.(E) Testing of the feature selected dataset as input into the seven tested ML methods: LGBM, SVM, LR, RF, LDA, KNN and NB.(F) Evaluate the best performing model based on the results of metrics achieved using the independent test sequence dataset.(G) Developing a web server FEOpti-ACVP based on the final optimized model.

Figure 2 .
Figure 2. Top 14 of 5-fold cross-validation and independent test results were averaged from runs conducted with and without SMOTE-based feature augmentation/balancing.Performance metrics were ranked from best to worst after completing a total of 42 runs.
selected features, detected via the LGBM algorithm, represented an essential step in model building.The improved performance of the developed model following SMOTE-based data balancing was also evident, as shown by the dimensionality reduction visualization.In the final implementation, we chose to use the top 128-dimensional features from UniRep and combined these with the LGBM based ML algorithm to construct the final prediction model.Finally, when compared with existing ACVP prediction

Figure 3 .
Figure 3.Comparison of model performance using the full or feature selected feature vectors.The results are obtained based on seven machine learning models, (A) LGBM, (B) SVM, (C) RF, (D) LR, (E) KNN, (F) LDA, and (G) NB.Numbers indicate the average of model performance in 5-fold cross-validation and independent test results.

Table 1 :
Results of the 5-fold cross-validation and independent testing of the three feature models developed using UniRep and BERT feature extraction.The used ML are indicated on the right a The used ML algorithms were: LGBM: light gradient boosting machine; SVM: support vector machine; RF: random forest; LR: logistic regression; KNN: k-nearest neighbors; LDA: latent dirichlet allocation; NB: naive bayes.b Feature vectors were created using UniRep: unified representation with 1900D; BERT: bidirectional encoder representation from transformers with 768D; UniRep+BERT: UniRep and BERT feature fusion vector with 2668D.c Best performance values using the same ML model are shown in bold.d Best performance values are shown in bold and are underlined.

Table 2 :
The results of 5-fold cross-validation and independent testing of feature selected feature vector extracted by UniRep and BERT

.999 0.998 1.000 0.998 1.000 1.000 0.999
The ML algorithms were: LGBM: light gradient boosting machine; SVM: support vector machine; RF: random forest; LR: logistic regression; KNN: k-nearest neighbors; LDA: latent dirichlet allocation; NB: naive bayes.b Features were extracted using UniRep: unified representation; BERT: bidirectional encoder representation from transformers; UniRep+BERT: Fused UniRep and BERT feature vectors.The number of features (Dim column) was selected using LGBM.c Best performance values are shown in bold and are underlined. a